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Abstract — With the increasing prevalence of location-aware 
devices, trajectory data has been generated and collected in 
various application domains. Trajectory data carries rich in- 
formation that is useful for many data analysis tasks. Yet, 
improper publishing and use of trajectory data could jeopardize 
individual privacy. However, it has been shown that existing 
privacy-preserving trajectory data publishing methods derived 
from partition-based privacy models, for example fc-anonymity, 
are unable to provide sufficient privacy protection. 

In this paper, motivated by the data publishing scenario at the 
Societe de transport de Montreal (STM), the public transit agency 
in Montreal area, we study the problem of publishing trajectory 
data under the rigorous differential privacy model. We propose 
an efficient data-dependent yet differentially private sanitization 
algorithm, which is applicable to different types of trajectory 
data. The efficiency of our approach comes from adaptively 
narrowing down the output domain by building a noisy prefix 
tree based on the underlying data. Moreover, as a post-processing 
step, we make use of the inherent constraints of a prefix tree to 
conduct constrained inferences, which lead to better utility. This 
is the first paper to introduce a practical solution for publishing 
large volume of trajectory data under differential privacy. We 
examine the utility of sanitized data in terms of count queries 
and frequent sequential pattern mining. Extensive experiments 
on real-life trajectory data from the STM demonstrate that our 
approach maintains high utility and is scalable to large trajectory 
datasets. 

I. Introduction 

Over the last few years, location-aware devices, such as 
RFID tags, cell phones, GPS navigation systems, and point of 
sale (POS) terminals, have been widely deployed in various 
application domains. Such devices generate large volume of 
trajectory data that could be used for many important data 
analysis tasks, such as marketing analysis Qj, long-term net- 
work planning |2|, customer behavior analysis |3], and demand 
forecasting [1|. However, trajectory data often contains sensi- 
tive personal information, and improper publishing and use 
of trajectory data may violate individual privacy. The privacy 
concern of publishing trajectory data is best exemplified by 
the case of the Societe de transport de Montreal (STM, 
http://www.stm.info), the public transit agency in Montreal 
area. 

In 2007, the STM deployed the smart card automated fare 
collection (SCAFC) system as a secure method of user vali- 
dation and fare collection. In addition to revenue collection, 



the system generates and collects passengers' trajectory data 
every day. Transit information, such as smart card ID and 
station ID, is collected when a passenger swipes his smart 
card at a SCAFC system, and is then stored in a centtal 
database management system, where the transit information of 
a passenger is organized as an ordered list of stations, a kind 
of trajectory data (see a formal definition in Section IIII-AI) . 
Periodically, the IT department of the STM shares such 
trajectory data with other departments, e.g., the marketing 
department, for basic data analysis, and publishes its trajectory 
data to external research institutions for more complex data 
mining tasks. According to the preliminary research Q, flU, 
|5), the STM can substantially benefit from trajectory data 
analysis at strategic, tactical, and operational levels. Yet, it 
has also realized that the nature of trajectory data is raising 
major privacy concerns on the part of card users in information 
sharing @]. This fact has been an obstacle to further conduct 
trajectory data analysis tasks and even perform regular com- 
mercial operations. Similarly, many other sectors, for example 
cell phone communication and credit card payment [6], have 
been facing the dilemma in trajectory data publishing and 
individual privacy protection. 

The privacy concern in trajectory data sharing has spawned 
some research Q, 00, 0, Qol, OH, H3 on privacy- 
preserving trajectory data publishing based on partition-based 
privacy models [ 13|, for example fc-anonymity 11411 (or (fc, 5)- 
anonymity [7Q and confidence bounding (8), ifTTI . However, 
many types of privacy attacks, such as composition attack (131 , 
deFinetti attack II 151 and foreground knowledge attack HI 61 . 
have been identified on the approaches derived using the 
partition-based privacy models, demonstrating their vulnera- 
bility to an adversary's background knowledge. Due to the 
deterministic nature of partition-based privacy models, it is 
foreseeable that more types of privacy attacks could be dis- 
covered on these privacy models in the future. Consequently, in 
recent years differential privacy ifTTll has become the de facto 
successor to partition-based privacy models. Differential pri- 
vacy provides provable privacy guarantees independent of an 
adversary's background knowledge and computational power 
(this claim may not be valid in some special cases [18], but 
is still correct for our scenario, as discussed in Section HV-Cb . 
Differential privacy requires that any computation based on 



the underlying database should be insensitive to the change of 
a single record. Therefore, a record owner can be ensured that 
any privacy breach would not be a result of participating in a 
database. 

The traditional non- interactive approaches [19], ll20l . ll2D . 
[22] for generating differentially -private releases are data- 
independent in the sense that all possible entries in the output 
domain need to be explicitly considered no matter what the 
underlying database is. For high-dimensional data, such as 
trajectory data, this is computationally infeasible. Consider a 
trajectory database T> with all locations drawn from a universe 
of size m. Suppose the maximum length of trajectories (the 
number of locations in a trajectory) in V is I, These approaches 
need to generate ^Z i=1 m % = m m S 1 m output entries. For a 
trajectory database with m = 1, 000 and I — 20, it requires 
to generate 10 60 entries. Hence, these approaches are not 
computationally applicable with today's systems to real-life 
trajectory databases. 

Two very recent papers ll23l . Il24ll point out that more 
efficient and more effective solutions could be achieved by 
carefully making use of the underlying database. We call such 
solutions data- dependent. The general idea of data-dependent 
solutions is to adaptively narrow down the output domain by 
using noisy answers obtained from the underlying database. 
However, the methods in ll23ll . Il24l cannot be applied to 
trajectory data for two reasons. First, the methods in [23 1, [24 1 
require taxonomy trees to guide the data publication process. 
For trajectory data, there does not exist a logical taxonomy 
tree due to the sequentiality among locations. Second, the 
methods only work for sets, yet a trajectory may contain a 
bag of locations. Therefore, non-trivial efforts are needed to 
develop a differentially private data publishing approach for 
trajectory data. 

Protecting individual privacy is one aspect of sanitizing 
trajectory data. Another equally important aspect is preserving 
utility in sanitized data for data analysis. Motivated by the 
STM case, in this paper, we are particularly interested in two 
kinds of data mining tasks, namely count queries (see a formal 
definition in Section IIII-Db and frequent sequential pattern 
mining ll25l . Count queries, as a general data analysis task, are 
the building block of many more advanced data mining tasks. 
In the STM scenario, with accurate answers to count queries 
over sanitized data, data recipients can obtain the answers 
to questions, such as "how many passengers have visited 
both stations Guy-Concordia and McGill □ within the last 
week". Frequent sequential pattern mining, as a concrete data 
mining task, helps, for example, the STM better understand 
passengers' transit patterns and consequently allows the STM 
to adjust its network geometry and schedules in order to 
better utilize its existing resources. These utility requirements 
naturally demand a solution that publishes data, not merely 
data mining results. 

Contribution. In this paper, we study the problem of pub- 

1 Guy-Concordia and McGill are two metro stations on the green line of 
the Montreal metro network. 



lishing trajectory data that simultaneously protects individual 
privacy under the framework of differential privacy and pro- 
vides high utility for different data mining tasks. This is the 
first paper that introduces a practical solution for publishing 
large volume of real-life trajectory data via differential privacy. 
The previous works Q, 00, (H, (lO), 0T|, flj) on privacy- 
preserving trajectory data publishing cannot be used to achieve 
differential privacy because of their deterministic nature. We 
summarize the major contributions of the paper as follows. 

• We propose a non-interactive data-dependent sanitization 
algorithm of runtime complexity 0(|2?| • \C\) to generate 
a differentially private release for trajectory data, where 
| V | is the size of the underlying database T> and \C\ is the 
size of the location universe. The efficiency is achieved by 
constructing a noisy prefix tree, which adaptively guides 
the algorithm to circumvent certain output sub-domains 
based on the underlying database. 

• We design a statistical process for efficiently constructing 
a noisy prefix tree under Laplace mechanism. This is 
vital to the scalability of processing datasets with large 
location universe sizes. 

• We make use of two sets of inherent constraints of a 
prefix tree to conduct constrained inferences, which helps 
generate a more accurate release. This is the first paper 
of applying constrained inferences to non-interactive data 
publishing. 

• We conduct an extensive experimental study over the 
real-life trajectory dataset from the STM. We examine 
utility of sanitized data for two different data mining 
tasks, namely count queries (a generic data analysis 
task) and frequent sequential pattern mining (a concrete 
data mining task). We demonstrate that our approach 
maintains high utility and is scalable to large volume of 
real-life trajectory data. 

The rest of the paper is organized as follows. Section HI] 
reviews related work. Section [Til] provides the preliminaries 
for our solution. A two-stage sanitization algorithm for tra- 
jectory data is proposed in Section [IV] and comprehensive 
experimental results are presented in Section [V] Finally, we 
conclude the paper in Section [VU 

II. Related Work 

In this section, we review the state of the art of privacy- 
preserving trajectory data publishing techniques and recent 
applications of differential privacy. 

A. Privacy-Preserving Trajectory Data Publishing 

Due to the ubiquitousness of trajectory data, some recent 
works 0, 00, 13, flU, CIl, 03 have started to study 
privacy-preserving trajectory data publishing from different 
perspectives. Abul et al. Q propose the (k, <5)-anonymity 
model based on the inherent imprecision of sampling and 
positioning systems, where S represents the possible location 
imprecision. The general idea of Q is to modify trajectories 
by space translation so that k different trajectories co-exist in 
a cylinder of the radius S. Terrovitis and Mamoulis [8| model 



an adversary's background knowledge as a set of projections of 
trajectories in a trajectory database, and consequently propose 
a data suppression technique that limits the confidence of 
inferring the presence of a location in a trajectory to a pre- 
defined probability threshold. Yarovoy et al. [9] propose to k- 
anonymize a moving object database (MOD) by considering 
timestamps as the quasi-identifiers (QIDs). Adversaries are 
assumed to launch privacy attacks based on attack graphs. 
Their approach first identifies anonymization groups and then 
generalizes the groups to common regions according to the 
QIDs while achieving minimal information loss. Monreale et 
al. lfl2l present an approach based on spatial generalization in 
order to achieve fc-anonymity. The novelty of their approach 
lies in a generalization scheme that depends on the underlying 
trajectory dataset rather than a fixed grid hierarchy. 

Hu et al. [10 1 present the problem of fc-anonymizing a 
trajectory database with respect to a sensitive event database. 
The goal is to make sure that every event is shared by at 
least k users. Specifically, they develop a new generalization 
mechanism known as local enlargement, which achieves better 
utility than conventional hierarchy- or partition-based gener- 
alization. Chen et al. ifTTl consider the emerging trajectory 
data publishing scenario, in which users' sensitive attributes 
are published with trajectory data and consequently propose 
the (K, C)i-privacy model that thwarts both identity linkages 
on trajectory data and attribute linkages via trajectory data. 
They develop a generic solution for various data utility metrics 
by use of local suppression. All these approaches Q, 0, 
@, ifTOl . IfTTl . |[T2"1 are built based on partition-based privacy 
models, and therefore are not able to provide sufficient pri- 
vacy protection for trajectory data. The major contribution of 
our paper is the use of differential privacy, which provides 
significantly stronger privacy guarantees. 

B. Applications of Differential Privacy 

In the last few years, differential privacy has been employed 
in various applications. Currently most of the research on 
differential privacy concentrates on the interactive setting with 
the goal of either reducing the magnitude of added noise 12611 . 
1 27 1, 1 28 1, ll29l or releasing certain data mining results |3Q| . 
ED, [32), ED, OH. Dwork E3 provides an overview of 
recent works on differential privacy. 

The works closest to ours are by Blum et al. 1191 . Dwork 
et al. (201 Xiao et al. f2j], Xiao et al. J23, Mohammed et 
al. 11231 . and Chen et al. 11241 . All these works consider non- 
interactive data publishing under differential privacy. Blum et 
al. |fl9l demonstrate that it is possible to release synthetic 
private databases that are useful for all queries over a dis- 
cretized domain from a concept class with polynomial Vapnik- 
Chervonenkis dimension 0. However, their mechanism is not 
efficient, taking runtime complexity of super poly (\C\,\I\), 
where \C\ is the size of a concept class and |/| the size of 
the universe. Dwork et al. [20 1 propose a recursive algorithm 



TABLE I 
NOTATIONAL CONVENTIONS 



Symbol 


Description 


A 


A privacy mechanism 


A/ 


The global sensitivity of the function / 


e,e 


The total privacy budget, a portion of privacy budget 


C,Li 


The location universe, a location in the universe 


T,U 


A trajectory, a location in a trajectory 


ls(T) 


The set of locations in trajectory T 


V,T> 


A trajectory database, a sanitized database of T> 


Q{V) 


A count query over the database T> 


VT 


A prefix tree 


Root(VT) 


The virtual root of the prefix tree VT 


prefix(v, VT) 


The prefix represented by the node v of VT 


tr(v) 


The set of trajectories with the prefix prefix(v,VT) 


c(v), c(v), c(v) 


The noisy count, intermediate estimate and consistent 




estimate of respectively 




Sanity bound 


T k {V) 


The top-A; most frequent sequential patterns of T> 


\C\,\T\,\V\ 


The size of the location universe, a trajectory, and 




a trajectory database respectively 


k 


The number of empty nodes that pass the boolean tests 



2 Vapnik-Chervonenkis dimension is a measure of the complexity of a 
concept in the class. 



of generating a synthetic database with runtime complexity of 
poly(\C\, \I\). This improvement, however, is still insufficient 
to handle real-life trajectory datasets due to the exponential 
size of \C\. Xiao et al. [21] propose a wavelet-transformation 
based approach for relational data to lower the magnitude of 
noise, rather than adding independent Laplace noise. Xiao et 
al. Il22l propose a two-step algorithm for relational data. It 
first issues queries for every possible combination of attribute 
values to the PINQ interface [36], and then produces a 
generalized output based on the perturbed results. Similarly, 
the algorithms [21], [22] need to process all possible entries in 
the entire output domain, giving rise to a scalability problem 
in the context of trajectory data. 

Two very recent papers ||23l . IT241 point out that data- 
dependent approaches are more efficient and more effective 
for generating a differentially private release. Mohammed et 
al. l23l propose a generalization-based sanitization algorithm 
for relational data with the goal of classification analysis. 
Chen et al. [24 1 propose a probabilistic top-down partitioning 
algorithm for set-valued data. Both approaches l23l . Il24ll 
make use of taxonomy trees to adaptively narrow down the 
output domain. However, due to the reasons mentioned in 
Section U they cannot be applied to trajectory data, in which 
sequentiality is a major concern. 

III. Preliminaries 

In this section, we define a trajectory database and a 
prefix tree, review differential privacy, and present the utility 
requirements. The notational conventions are summarized in 
Table H 

A. Trajectory Database 

Let C = {Li,L2, ■ ■ ■ ,L\c\} be the universe of locations, 
where \C\ is the size of the universe. Without loss of generality, 
we consider locations as discrete spatial areas in a map. For 
example, in the STM case, £ represents all stations in the STM 
transportation network. This assumption also applies to many 



TABLE II 
Sample trajectory database 



Rec. # 


Path 


1 


Li - 


- L% -+ L 3 


2 


L x - 


■ L2 


3 


U - 


Li -> Li 


4 


Lx - 


■ Z/2 — L4 


5 


Li - 


■ Li — > L3 


6 


L 3 — 


^2 


7 


Li - 


• -L2 — -^4 — ^ -t'l 


g 


L s — 


•Li 



other types of trajectory data, e.g., purchase records, where 
a location is a store's address. We model a trajectory as an 
ordered list of locations drawn from the universe. 

Definition 3.1 (Trajectory): A trajectory T of length \T\ is 
an ordered list of locations T — t\ — > ti — >• • • • — » tm, where 
VI < i < |T|, t z e Cm 

A location may occur multiple times in T, and may occur 
consecutively in T. Therefore, given C = {Li, L2, L3, L4}, 
T = L\ — ► L2 — > £2 is a valid trajectory. In some cases, 
a trajectory may include timestamps. We point out that our 
approach also works for this type of trajectory data and discuss 
the details in Section ITV-DI A trajectory database is composed 
of a multiset of trajectories; each trajectory represents the 
movement history of a record owner. A formal definition is 
as follow. 

Definition 3.2 (Trajectory Database): A trajectory 
database T) of size \D\ is a multiset of trajectories 

V = {Di,D 2 ,--- ,D m }. m 

Table UD presents a sample trajectory database with C = 
{Li, L2, £3, L4}. In the rest of the paper, we use the term 
database and dataset interchangeably. 

B. Prefix Tree 

A trajectory database can be represented in a more compact 
way in terms of a prefix tree. A prefix tree groups trajectories 
with the same prefix into the same branch. We first define a 
prefix of a trajectory below. 

Definition 3.3 (Trajectory Prefix): A trajectory S — s± — > 
S2 — > ■ ■ ■ — > S\g\ is a prefix of a trajectory T = t\ — > t<i — > 
■ • • — > t\T\> denoted by S ^ T, if and only if \S\ < \T\ and 
VI < i < \S\, Si = ti. ■ 

For example, L\ — > L2 is a prefix of L\ — > L2 — >• L4 — > L3, 
but Li — > L4 is not. Note that a trajectory prefix is a trajectory 
per se. Next, we formally define a prefix tree below. 

Definition 3.4 (Prefix Tree): A prefix tree VT of a trajec- 
tory database T> is a triplet VT — (V, E, Root(VT)), where 

V is the set of nodes labeled with locations, each correspond- 
ing to a unique trajectory prefix in T>; E is the set of edges, 
representing transitions between nodes; Root(VT) G V is the 
virtual root of VT. The unique trajectory prefix represented 
by a node v G V, denoted by prefix(v, VT), is an ordered 
list of locations starting from Root{VT) to v. ■ 




Fig. 1. The prefix tree of the sample data 

Each node v G V of VT keeps a doublet in the form of 
(tr(v), c(v)), where tr(v) is the set of trajectories in V having 
the prefix prefix(v, VT), that is, {D G T> : prefix(v,VT) ^ 
£)}, and c(v) is a noisy version of |ir(v)| (e.g., plus 
Laplace noise). tr(Root(VT)) contains all trajectories in V. 
We call the set of all nodes of VT at a given depth i a level 
of VT, denoted by level(i, VT). Root(VT) is at depth zero. 
Figure Q] illustrates the prefix tree of the sample database in 
Table [II] where each node v is labeled with its location and 
\tr(v)\. 

C. Differential Privacy 

Differential privacy is a relatively new privacy model stem- 
ming from the field of statistical disclosure control. Differen- 
tial privacy, in general, requires that the removal or addition 
of a single database record does not significantly affect the 
outcome of any analysis based on the database. Therefore, 
for a record owner, any privacy breach will not be a result of 
participating in the database since anything that can be learned 
from the database with his record can also be learned from the 
one without his record. We formally define differential privacy 
in the non-interactive setting |fl9ll as follow. 

Definition 3.5 (e- differential privacy): A non-interactive 
privacy mechanism A gives e-differential privacy if for any 
database T>\ and T>2 differing onjit most one record, and for 
any possible sanitized database T> G Range(A), 

Pr[A(V 1 ) = V] < e e x Pr[A(V 2 ) = V] (1) 

where the probability is taken over the randomness of A ■ 

Two principal techniques for achieving differential pri- 
vacy have appeared in the literature, namely Laplace mech- 
anism I117B and exponential mechanism II37I . A fundamental 
concept of both techniques is the global sensitivity of a 
function ifTTl mapping underlying databases to (vectors of) 
reals. 

Definition 3.6 (Global Sensitivity): For any function f : 
D — > M. d , the sensitivity of f is 

A/= max ||/(X>i)-/(2>a)||i (2) 

T>1 ,T>2 

for all T>x , T>2 differing in at most one record. ■ 



Functions with lower sensitivity are more tolerant towards 
changes of a database and, therefore, allow more accurate 
differentially private mechanisms. 

Laplace Mechanism. For the analysis whose outputs are real, 
a standard mechanism to achieve differential privacy is to 
add Laplace noise to the true output of a function. Dwork et 
al. ifrTI propose the Laplace mechanism which takes as inputs 
a database V, a function /, and the privacy parameter e. The 
noise is generated according to a Laplace distribution with the 
probability density function (pdf) p{x\\) = ^e"' 1 '^, where 
A is determined by both A/ and the desired privacy parameter 
e. 

Theorem 3.1: For any function f : D — ► WL d , the mechanism 

A 

A{V) = f(V) + Laplace(Af/e) (3) 

gives e-differential privacy. ■ 

For example, for a single count query Q over a dataset 
T>, returning Q(T>) + Laplace(l/e) maintains e-differential 
privacy because a count query has a sensitivity 1. 

Exponential Mechanism. For the analysis whose outputs 
are not real or make no sense after adding noise, McSherry 
and Talwar 11371 propose the exponential mechanism that 
selects an output from the output domain, r e 1Z, by taking 
into consideration its score of a given utility function q in 
a differentially private manner. The exponential mechanism 
assigns exponentially greater probabilities of being selected 
to outputs of higher scores so that the final output would be 
close to the optimum with respect to q. The chosen utility 
function q should be insensitive to changes of any particular 
record, that is, has a low sensitivity. Let the sensitivity of q 
be Aq = maxvr.Ui.Ca \l(' D i: r ) ~ 9(^2, r)|. 

Theorem 3.2: Given a utility function q : (T> x TVj — > R/or 
a database T>, the mechanism A, 

A(T),q) = / return r with probability oc exp(—^^ — -)j 

(4) 

gives e-differential privacy. ■ 

Composition Property. For a sequence of computations, its 
privacy guarantee is provided by the composition properties. 
Any sequence of computations that each provides differen- 
tial privacy in isolation also provides differential privacy in 
sequence, which is known as sequential composition |36|. 

Theorem 3.3: Let Ai each provide ti-differential privacy. 
A sequence of Ai(T>) over the database V provides ej- 
differential privacy. ■ 

In some special cases, in which a sequence of computations 
is conducted on disjoint databases, the privacy cost does not 
accumulate, but depends only on the worst guarantee of all 
computations. This is known as parallel composition [36 |. This 
property could and should be used to obtain good performance. 



Theorem 3.4: Let Ai each provide ti-differential privacy. A 
sequence of Ai(T>i) over a set of disjoint datasets T>i provides 
(max(ei))-differential privacy. ■ 

D. Utility Requirements 

The sanitized data is mainly used to perform two different 
data mining tasks, namely count query and frequent sequential 
pattern mining [25 1. Count queries, as a general data analysis 
task, are the building block of many data mining tasks. We 
formally define count queries over a trajectory database below. 

Definition 3.7 (Count Query): For a given set of locations 
L drawn from the universe C, a count query Q over a database 

V is defined to be Q(D) = \{D 6 V : L C ls(D)}\, where 
ls(D) returns the set of locations in D. ■ 

Note that sequentiality among locations is not considered in 
count queries, because the major users of count queries are, 
for example, the personnel of the marketing department of the 
STM, who are merely interested in users' presence in certain 
stations for marketing analysis, but not the sequentiality of 
visiting. Instead, the preservation of sequentiality in sanitized 
data is examined by frequent sequential pattern mining. We 
measure the utility of a count query over the sanitized database 

V by its relative error II2TI . [29|, [24 1 with respect to the true 
result over the original database V, which is computed as: 

\Q{V)-Q(V)\ 
max{Q(T>), s} ' 

where s is a sanity bound used to mitigate the influences of 
the queries with extremely small selectivities lETTl . 11291 , [24|. 

For frequent sequential pattern mining, we measure the 
utility of sanitized data in terms of true positive, false positive 
and false drop 11381 . Given a positive number k, we denote 
the set of top k most frequent sequential patterns identified 
on the original database T> by Tk(p) and the set of frequent 
sequential patterns on the sanitized database D by Jt-(D). 
True positive is the number of frequent sequential patterns 
in JFf,(T>) that are correctly identified in .Ffc(D), that is, 
J-fc(X>) fl.Ffc(2?)|. False positive is defined to be the number 
of infrequent sequential patterns in T> that are mistakenly 
included in 7^(1?), that is, 

\F k {V)-F k (V)C\F k {V)\. 

False drop is defined to be the number of frequent sequential 
patterns in Fk{T>) that are wrongly omitted in 7^.(2?), that is, 

\F k {V)uF k {V)-T k {V)\. 

Since in our setting |.Ffc(X>)| = \Fk{T>)\ = k, false positives 
always equal false drops. 

IV. Sanitization Algorithm 

We first provide an overview of our two-step sanitization 
algorithm in Algorithm [TJ Given a raw trajectory dataset V, 
a privacy budget e, and a specified height of the prefix tree 
h, it returns a sanitized dataset V satisfying e-differential 



Algorithm 1 Trajectory Data Sanitization Algorithm 

Input: Raw trajectory dataset T> 

Input: Privacy budget e 

Input: Height of the prefix tree h 

Output: Sanitized dataset T> 

1: Noisy prefix tree VT <— BuildNoisyPrefixTree(T>, e, h); 

2: Sanitized dataset T> GeneratePrivateRelease(VT); 

3: return T>: 



privacy. BuildNoisyPrefixTree builds a noisy prefix tree VT 
for T> using a set of count queries; GeneratePrivateRelease 
employs a utility boosting technique on VT and then generates 
a differentially private release. 

A. Noisy Prefix Tree Construction 

The noisy prefix tree of the raw trajectory dataset T> cannot 
be simply generated at a time by scanning the dataset once, 
as the way we construct a deterministic prefix tree. To satisfy 
differential privacy, we need to guarantee that every trajectory 
that can be derived from the location universe (either in or not 
in V) has certain probability to appear in the sanitized dataset 
so that the sensitive information in V could be masked. 

Our strategy for BuildNoisyPrefixTree is to recursively group 
trajectories in T> into disjoint sub-datasets based on their 
prefixes and resort to the well-understood query model to 
guarantee differential privacy. Procedure Q]presents the details 
of BuildNoisyPrefixTree. We first create a prefix tree VT with 
a virtual root Root(VT) (Lines 2-4). To build VT, we employ 
a uniform privacy budget allocation scheme, that is, divide the 
total privacy budget e into equal portions e = f , each is used 
for constructing a level of VT (Line 5). In Lines 6-19, we 
iteratively construct each level of VT in a noisy way. At each 
level, for each node v, we consider every location in £ as 
v's potential child u in order to satisfy differential privacy. 
Our goal is to identify the children that are associated with 
non-zero number of trajectories so that we can continue to 
expand them. However, we cannot make decision based on 
true numbers, but noisy counts. One important observation 
is that all such potential children are associated with disjoint 
trajectory subsets and therefore e can be used in full for each 
u because of Theorem 13.41 

For a dataset with a very large location universe C, pro- 
cessing all locations explicitly may be slow. We provide 
an efficient implementation by separately handling potential 
child nodes associated with non-zero and zero number of 
trajectories (referred as non-empty node and empty node 
respectively in the following). For a non-empty node u, 
we add Laplace noise to |ir(«)| and use the noisy answer 
c(u) = NoisyCount(\tr(u)\,e) to decide if it is non-empty. 
If c(u) is greater than or equal to the threshold 9 = (two 
times of the standard deviation of noise), we deem that u is 
"non-empty" and insert u to VT as v's child. We choose a 
relatively large threshold mainly for the reason of efficiency, 
and meanwhile it also has a positive impact on utility because 
more noisy nodes can be pruned. Since non-empty nodes 



Procedure 1 BuildNoisyPrefixTree Procedure 

Input: Raw trajectory dataset V 

Input: Privacy budget e 

Input: Height of the prefix tree h 

Output: Noisy prefix tree VT 



9: 
10: 
11: 

12 
13 
14 
15 
16 
17 
18 
19 
20 



i = 0; 

Create an empty prefix tree VT; 
Insert a virtual root Root(VT) to VT', 
Add all trajectories in T> to tr(Root(VT)); 

while i < h do 

for each node v G level(i,VT) do 

Generate a candidate set of nodes U from C, 
each labeled by a location L G C; 
for each node uel/do 
Consider u as v's child; 
Add the trajectories D in tr(v) s.t. 
prefix(u,VT) r< D to tr(u); 
c(u) = NoisyCount(\tr(u)\,e); 
if c(u) > 9 then 

Add u to VT as v's child; 
end if 
end for 
end for 



end while 
return VT; 



are typically of a small number, this process can be done 
efficiently. 

For the empty nodes, we need to conduct a series of 
independent boolean tests, each calculates NoisyCount(0, e) 
to check if it passes 9. The number of empty nodes that pass 
9, k, follows the binomial distribution B(m,pg), where m is 
the total number of empty nodes we need to check and pg is 
the probability for a single experiment to succeed. Inspired by 
Cormode et al.'s work 1391 , we design a statistical process for 
Laplace mechanism to directly extract k empty nodes without 
explicitly processing every empty node (in |39l , Cormode et 
al. design a statistical process for geometric mechanism [40], 
a discretized version of Laplace mechanism). 

Theorem 4.1: Independently conducting m pass/not pass 
experiments based on Laplace mechanism with privacy budget 
e and threshold 9 is equivalent to the following steps: 

1) Generate a value k from the binomial distribution 
B(m,pg), where pg = exp ^ e6 \ 

2) Select k uniformly random empty nodes without replace- 
ment with noisy counts sampled from the distribution 



P(x) 







Vx<9 



1 — exp(e9 — ex) \/x > 



Proof. The probability of a single experiment passing the 
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Fig. 2. The noisy prefix tree of the sample data 



threshold 9 is 



Pr[PASS] = 



e , - \, exp{-e6 
-exp{-ex)dx = — 



Since the experiments are independent, the number of suc- 
cessful experiments, k, follows the binomial distribution 
B(m, ez P(~ e9 ) ), Once k is determined, we can uniformly at 
random select k empty nodes. The probability density function 
of the noisy counts x for the k empty nodes, conditional on 
x > 9, is: 



p(x\x > 9) 







^exp(-ex) 



Vx < 

eexp(e9 — ex) Vr > 



The corresponding cumulative distribution function is: 



P(x) 







J„ eexp(e9 — ex)dx 
This completes the proof. ■ 



1 — exp(ei 



Vx < 
ex) Vr > 



Selection of h value. We discuss the selection of h value. A 
natural choice of h value is the maximum trajectory length of 
T>. However, this choice raises two problems. First, for most 
trajectory databases, the maximum trajectory length is much 
longer than the average trajectory length (see Section IV- Al for 
two examples), implying that extremely long trajectories are 
of very small supports. Keeping expanding the prefix tree 
for such long trajectories is counter-productive: it ends up 
with many noisy trajectories that do not exist in the original 
database and therefore results in poor utility. Second, in some 
cases, the maximum trajectory length itself may be sensitive. 
Since the length of a trajectory in T> could be arbitrarily large, 
the sensitivity of maximum trajectory length is unbounded. It 
follows that it is very difficult to design a differentially private 
mechanism to obtain a reliable maximum trajectory length. 

Evfimievski et al. [38] provides an insightful observation 
that could be used for the selection of h value. They point 
out that, in the context of transaction data, it is typically 
impossible to make transactions of size 10 and longer both 
privacy preserving and useful (regardless of the underlying 
dataset). This observation also applies to trajectory data. Long 
trajectories carry too much sensitive information to achieve a 
reasonable trade-off between privacy and utility. Theoretically, 
selecting a relatively small h value has a limited negative effect 
on the resulting utility. For count queries, long trajectories have 
very small supports. Moreover, Procedure[T]does not eliminate 



entire long trajectories, but just truncates their tailing parts, and 
therefore results in even smaller relative errors. Similarly, for 
frequent sequential pattern mining, the elimination of tailing 
parts of long trajectories with small supports does not sig- 
nificantly affect resulting frequent sequential patterns, which 
are usually of a small length. We experimentally confirm our 
analysis in Section [Vl 

Example 4.1: Given the trajectory database T> in Table 177] 
the height h = 3, and the calculated threshold 9 — 2, the 
construction of a possible noisy prefix tree VT is illustrated 
in Figure [2] A path of VT may be of a length shorter than h 
if it has been considered "empty" before h is reached. ■ 

B. Private Release Generation 

Based on the noisy prefix tree VT, we can generate 
the sanitized database by traversing VT once in postorder, 
calculating the number n of trajectories terminated at each 
node v and appending n copies of prefix(v, VT) to the 
output. However, due to the noise added to ensure differ- 
ential privacy, we may not be able to obtain a meaningful 
release. For example, in Figure [2] consider the root-to-leaf 
path Root(VT) -> L 3 -> L 2 -> L x . We have c(L 2 ) > c(L 3 ), 
which is counterintuitive because it is not possible, in general, 
to have more trajectories with the prefix prefix(u, VT) than 
trajectories with the prefix prefix(v, VT), where u is a child 
of v in VT. If we leave such inconsistencies unsolved, the 
resulting release may not be meaningful and therefore provides 
poor utility. 

Definition 4.1 (Consistency Constraint): In a non-noisy 
prefix tree, there exist two sets of consistency constraints: 

1) For any root-to-leaf path p, \fvi £ p,\tr{vi)\ < 
\tr(vi+i)\, where Vi is a child of Wi+i," 

2) For each node v, \tr{v)\ > T, u echUdren(v) \ tr (u)\- ■ 

Our goal is to enforce such consistency constraints on the 
noisy prefix tree in order to produce a consistent and more 
accurate private release. We adapt the constrained inference 
technique proposed in 1261 to adjust the noisy counts of nodes 
in the noisy prefix tree so that the constraints defined in 
Definition 14.11 are respected. Note that the technique proposed 
in ll26l cannot be directly applied to our case because: 1) 
the noisy prefix tree has an irregular structure (rather than 
a complete tree with a fixed degree); 2) the noisy prefix 
tree has different constraints |tr(«)j > T, u echildren(v) \ tr ( u )\ 



(rather than \tr(v)\ = £ uecWWren(t0 IM«)D- Consequently, 
we propose a two-phase procedure to obtain a consistent 
estimate with respect to Definition 14. 1 1 for each node (except 
the virtual root) in the noisy prefix tree VT. 

We first generate an intermediate estimate for the noisy 
count of each node v (except the virtual root) in VT. Consider 
a root-to-leaf path p of VT. Let us organize the noisy counts of 
nodes Uj € p into a sequence S = (c(«i), c(v 2 ), ■ ■ ■ ,c(v\ p \)), 
where v$ is a child of Vi+%. Let mean[i,j) denote the 
mean of a subsequence of S, (c(u,), c(uj+i), c(vj)}, that 

is, mean[i,j] = ^"jZ^+i"^ ■ We compute the intermediate 
estimates S by Theorem 14.21 1 26 1. 

Theorem 4.2: Let L m — minj e \ m>ri \rnaxi e [i t j]rnean[i, j] 

and U rn = maXie[i, m ] m i n j£{i,\p\] mean [h j\ S = 
(Lx,L,2,...,L\ p \) = (Ux, U2, U\ p \). ■ 

The result of Theorem 14.21 is the minimum L2 solution l26l 
of S that satisfies the first type of constraints in Definition 14. 11 
However, a node v in VT appears in \leaves(v,VT)\ root- 
to-leaf paths, where leaves(v,VT) denotes the leaves of the 
subtree of VT rooted at v, and therefore, has \leaves(v, VT) | 
intermediate estimates, each being an independent observation 
of the true count |tr(i;)|. We compute the consolidated inter- 
mediate estimate of v as the mean of the estimates, normally 
the best estimate for [41 1. We denote the consolidated 

intermediate estimate of v by c(v). 

After obtaining c(v) for each node v, we compute its 
consistent estimate c(v) in a top-down fashion as follows: 

(c(v) ifvelevel(l,VT) 
C ^ = I c(v) + min(Q, e(w) -^?°"«™<™) ?(tt) ) otherwise 

^ v / v ' \chtldren(w)\ > 

where w is the parent of v. It follows the intuition that the 

observation E, u echiidren(w) Z ( u ) > S H is stron S evidence 
that excessive noise is added to the children. Since the 
magnitude of noise in c(w) is approximately \children(w)\ 
times smaller than J2uechildren(w) c( u )> il » reasonable to 
decrease the children's counts according to c(w). However, 
we never increase the children's counts based on c(w) because 
a large c(w) simply indicates that many trajectories actually 
terminate at w. It is easy to see that the consistency constraints 
in Definition |4T| are respected among consistent estimates, and 
therefore the proof is omitted here. 

Once we obtain the consistent estimate for each node, we 
can generate the private release by a postorder traversal of VT. 
In Section [V] we demonstrate that the constrained inferences 
significantly improve utility of sanitized data. 

C. Analysis 

Privacy Analysis. Kifer and Machanavajjhala [18| point out 
that differential privacy must be applied with caution. The 
privacy protection provided by differential privacy relates to 
the data generating mechanism and deterministic aggregate- 
level background knowledge. In most cases, for example the 
STM case, the trajectories in the raw database are independent 



of each other and therefore the evidence of participation [18| 
of a record owner can be eliminated by removing his record. 
Furthermore, we assume that no deterministic statistics of the 
raw database will ever be released. Hence differential privacy 
is appropriate for our problem. We now show that AlgorithmQ] 
satisfies e-differential privacy. 

Theorem 4.3: Given the total privacy budget e, Algorithm]!] 
ensures e-differential privacy. ■ 

Proof. Algorithm [T] consists of two steps, namely Build- 
NoisyPrefixTree and GeneratePrivateRelease. In the proce- 
dure BuildNoisyP refixTree , our approach appeals to the well- 
understood query model to construct the noisy prefix tree VT. 
Consider a level of VT. Since all nodes on the same level 
contain disjoint sets of trajectories. According to Theorem l3.4l 
the entire privacy budget needed for a level is bounded by the 
worst case, that is, e — f-. The use of privacy budget on 
different levels follows Theorem 13.31 Since there are at most 
h levels, the total privacy budget needed to build the noisy 
prefix tree is < h x e = e. 

For the procedure GeneratePrivateRelease, we make use 
of the inherent constraints of a prefix tree to boost utility. 
The procedure only accesses a differentially private noisy 
prefix tree, not the underlying database. As proven by Hay 
et al. Il26l . a post-processing of differentially private results 
remains differentially private. Therefore, Algorithm Q] as a 
whole maintains e-differential privacy. ■ 

Complexity Analysis. Algorithm [T] is of runtime complexity 
0(|X>j • |£|), where \V\ is the number of trajectories in the 
input database V and \C\ is the size of the location universe. 
It comes from the following facts. For BuildNoisyP refixTree, 
the major computational cost is node generation and trajectory 
distribution. For each level of the noisy prefix tree, the number 
of nodes to generate approximates k\T>\, where k <C \L\ 
is a number depending on \C\. For each level, we need to 
distribute at most \V\ trajectories to the newly generated 
nodes. Hence, the complexity of constructing a single level 
is 0(|£>| • |£|). Therefore, the total runtime complexity of 
BuildNoisyP refixTree for constructing a noisy prefix tree of 
height h is 0(h\V\ ■ \C\). In GeneratePrivateRelease, the 
complexity of calculating the intermediate estimates for a 
single root-to-leaf path is 0(h 2 ). Since there can be at most 
\T> I distinct root-to-leaf paths, the complexity of computing 
all intermediate estimates is 0(h 2 \V\). To compute consistent 
estimates, we need to visit every node exactly twice, which 
is of complexity 0(|X>| • |£|). Similarly, the computational 
cost of generating the private release by traversing the noisy 
prefix tree once in postorder is 0(|£>| • |£|). Since h is a very 
small constant compared to \T>\ and \C\, the total complexity 
of Algorithm □ is 0(\V\ ■ \C\). 

D. Extensions 

In some applications, trajectory data may have locations 
associated with timestamps. The time factor is often discretized 
into intervals at different levels of granularity, e.g., hour, which 
is typically determined by the data publisher. All timestamps 
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Fig. 3. The distribution of trajectory length in the STM dataset 

of a trajectory database form a timestamp universe. This 
type of trajectories is composed of a sequence of location- 
timestamp pairs in the form of loc\t\ — > loc2ti 
loc n t n , where t\ < ti < ■ ■ ■ < t n . 

Our solution can be seamlessly applied to this type of 
trajectory data. In this case, we can label each node in the 
prefix tree by both a location and a timestamp. Therefore, two 
trajectories with the same sequence of locations but different 
timestamps are considered different. For example, L\T\ — > 
L%Ti is different from L1T2 — > L2T3, and the corresponding 
prefix tree will have two non-overlapping root-to-leaf paths. 
Consequently, when constructing the noisy prefix tree, in order 
to expand a node lociti, we have to consider the combinations 
of all locations and the timestamps in the time universe that 
are greater than ti (because the timestamps in a trajectory are 
non-decreasing), resulting in a larger candidate set. Due to 
the efficient implementation we propose in Section HV-AI the 
computational cost will remain moderate. 

V. Experimental Evaluation 

In this section, our objective is to examine the utility of 
sanitized data in terms of count queries and frequent sequential 
pattern mining, and to evaluate the scalability of our approach 
for processing large-scale real-life trajectory data. In partic- 
ular, we study the utility improvements due to constrained 
inferences. In the following, we refer the approach without 
constrained inferences as Basic and the one with constrained 
inferences as Full. We are not able to compare our solution to 
others because there does not exist any other solution that is 
able to process large-volume trajectory data in the framework 
of differential privacy. Our implementation was done in C++, 
and all experiments were performed on an Intel Core 2 Duo 
2.26GHz PC with 2GB RAM. 

A. Dataset 

We perform extensive experiments on the real-life trajectory 
dataset STM, which is provided by the Societe de transport de 
Montreal (STM). The STM dataset records the transit history 
of passengers in the STM transportation network. Each trajec- 
tory in the dataset is composed of a time-ordered list of stations 
visited by a passenger within a week in 2009. The STM dataset 
contains 1,210,096 records and has a location universe of 



size 1,012. The maximum trajectory length max\D\ of the 
STM dataset is 121 and the average length avg\D\ is 6.7. We 
show the distribution of trajectory length of the STM dataset 
in Figure [3] which justifies our rationale of selecting h value 
in Section IIV-AI Note that similar observations can be found 
in many real-life trajectory datasets, for example, the MSNBC 
dataset □ has the maximum trajectory length that is over 2500 
times larger than the average trajectory length. 

B. Utility 

Count Query. In our first set of experiments, we examine 
relative errors of count queries with respect to two different 
parameters, namely the privacy budget e and the noisy prefix 
tree height h. We follow the evaluation scheme from previous 
works II2TI . 0241 , For each privacy budget, we generate 40,000 
random count queries with varying numbers of locations. We 
call the number of locations in a query the length of the query. 
We divide the query set into 4 subsets such that the query 
length of the i-th subset is uniformly distributed in [1, ^] and 
each location is randomly drawn from the location universe 
C The sanity bound s is set to 0.1% of the dataset size, the 
same as ED, El- 
Figure |4] examines the average relative errors under varying 
privacy budgets from 0.5 to 1.5 while fixing the noisy prefix 
tree height h to 12. The X-axes represent the different subsets 
by their maximum query length max \Q\. As expected, the 
average relative error decreases when the privacy budget 
increases because less noise is added and the construction 
process is more precise. In general, our approach maintains 
high utility for count queries. Even in the worst case (e = 0.5 
and max | Q | = 3), the average relative error of Full is still 
less than 12%. Under the typical setting of the privacy budget 
e = 1.0, the relative error is less than 10% for all query 
subsets. Such level of relative error is deemed acceptable 
for data analysis tasks at the STM. The small relative error 
also demonstrates the distinct advantage of a data-dependent 
approach: noise would never accumulate exponentially as a 
data-independent approach. 

Figure [4] also shows that the constrained inference tech- 
nique significantly reduces the average relative errors with 
30% — 74% improvement. The constrained inference technique 
is beneficial especially when the privacy budget is small. 

Figure [5] studies how the average relative errors vary under 
different noisy prefix tree height h with query length fixed to 
6. We can observe that with the increase of h, the relative 
errors do not decrease mono tonic ally. Initially, the relative 
error decreases when h increases because the increment of 
h allows to retain more information from the underlying 
database. However, after a certain threshold, the relative error 
becomes larger with the increase of h. The reason is that 
when h gets larger, the privacy budget assigned to each level 
becomes smaller, and therefore noise becomes larger. It is 
interesting to point out that in fact a larger h value does 

3 MSNBC is publicly available at UCI machine learning repository 
I http://archive.ics.uci.edu/ml/index.html I. 
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not always lead to a noisy prefix tree with a larger height. 
Recall that a path of a noisy prefix tree needs to have sufficient 
number of trajectories to reach h levels. As a result, specifying 
a larger h may not be able to obtain more information from 
the underlying database, but only decreases the privacy budget 
of each level. This confirms our analysis in Section IIV-AI 
that in practice a good h value does not need to be very 
large. Moreover, from Figure [5] we can learn that desirable 
relative errors can be achieved within a relatively wide range 
of h values (e.g., 10-16). Hence, it becomes easy for a data 
publisher to select a good h value. 

In addition, from Figure [3] we can see that the benefits of the 
constrained inference technique are two-fold. First, we obtain 
much smaller relative errors in all different settings. Second, 
the relative errors become more stable with respect to varying 
height values. 

Frequent Sequential Pattern Mining. In the second set 
of experiments, we demonstrate the utility of sanitized data 
by a concrete data mining task, frequent sequential pattern 
mining. Specifically, we employ PrefixSpan to mine frequent 
sequential patterns Q 

Table [III] shows how the utility changes with different top 
k values while fixing e = 1.0 and h = 12. When k = 50, the 
sanitized data (under both Basic and Full) is able to give the 
exact top 50 most frequent patterns. With the increase of k 
values, the accuracy (the ratio of true positive to k) decreases. 

4 An implementation of the PrefixSpan algorithm proposed in 1 42 ], available 
at http://code.google.eom/p/prefixspan/ 



However, even when k = 250, the accuracy of Full is still as 
high as 197/250 = 78.8%. In addition, we can observe that 
the constrained inference technique again helps obtain more 
accurate results for frequent sequential pattern mining. When 
k = 250, the improvement due to constrained inferences is 
10.4%. 

Table HVl presents the utility for frequent sequential pattern 
mining under different privacy budgets while fixing h = 12 
and k — 200. Generally, larger privacy budgets lead to more 
true positives and fewer false positives (false drops). This 
conforms to the theoretical analysis that a larger privacy 
budget results in less noise and therefore a more accurate 
result. Since the most frequent sequential patterns are of small 
length (typically less than or equal to 3), they have large 
supports from the underlying database. As a result, the utility 
is insensitive to varying privacy budgets, and the accuracy is 
high even when the privacy budget is small. When e = 0.5, 
the accuracy of Full is 160/200 = 80%. Furthermore, it can 
be learned again that constrained inferences are also helpful 
for frequent sequential pattern mining. 

Table [V] studies how the utility varies under different noisy 
prefix tree height values with e = 1.0 and k = 200. It is 
interesting to see that under Basic the best result is obtained 
when h = 6. This again confirms our analysis that truncating 
the tailing parts of long trajectories has very limited impact 
on the results of frequent sequential pattern mining. For small 
h values, due to a larger portion of privacy budget assigned, 
the construction of each level is more precise, and therefore 
the accuracy is even higher. With constrained inferences, we 
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TABLE IV 

Utility for frequent sequential pattern mining vs. privacy 

BUDGET 
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are able to obtain better and, more importantly, more stable 
results. This is vital because under the same h value (e.g., 10- 
16) the release can simultaneously maintain high utility for 
both count queries and frequent sequential pattern mining. 

C. Scalability 

In the last set of experiments, we examine the scalability 
of our approach, one of the most important improvement 
over data-independent approaches. According to the analysis 
in Section IIV-C1 the runtime complexity of our approach is 
dominated by the database size \T>\ and the location universe 
size \C\. Therefore, we study the runtime under different 
database sizes and different location universe sizes for both 
Basic and Full. Figure[6]a presents the runtime under different 
database sizes with e = 1.0, h = 20 and \C\ = 1,012. The 
test sets are generated by randomly extracting records from 
the STM dataset. We can observe that the runtime is linear 
to the database size. When \V\ = 1,200,000, the runtime 
of Full is just 24 seconds. Moreover, it can be observed that 
the computational cost of constrained inferences is negligible. 
This further confirms the benefit of conducting constrained 
inferences. Figure [6]b shows how the runtime varies under 
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Fig. 6. Runtime vs. different parameters. 



different location universe sizes. For each universe size, we 
remove all locations falling out of the universe from the STM 
dataset. This results in a smaller database size. Consequently, 
we fix the database size for all test sets to 800,000. Again, 
it can be observed that the runtime scales linearly with the 
location universe size and that the computational cost of 
constrained inferences is very small and stable under different 
location universe sizes. As a summary, our approach is scalable 
to large trajectory datasets. It takes less 25 seconds to sanitize 
the STM dataset in all previous experiments. 

VI. Conclusion 

All existing techniques for privacy-preserving trajectory 
data publishing are derived using partition-based privacy mod- 
els, which have been shown failing to provide sufficient 
privacy protection. In this paper, motivated by the STM case, 
we study the problem of publishing trajectory data in the 
framework of differential privacy. For the first time, we present 
a practical data-dependent solution for sanitizing large-scale 
real-life trajectory data. In addition, we develop a constrained 
inference technique in order to better the resulting utility. We 
stress that our approach applies to different types of trajectory 
data. We believe that our solution could benefit many sectors 
that face the dilemma between the demands of trajectory data 
publishing and privacy protection. In future work, we intend 
to investigate the utility of sanitized data on other data mining 
tasks, for example, classification and clustering. 
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